Statistical Issues and Analyses of in vivo and in vitro Genomic Data in order to Identify Clinically Relevant Profiles

In vitro experimentation provides a convenient controlled environment for testing biological hypotheses of functional genomics in cancer induction and progression. However, it is necessary to validate resulting gene signatures from these in vitro experiments in human tumor samples (i.e. in vivo). We discuss the several methods for integrating data from these two sources paying particular attention to formulating statistical tests and corresponding null hypotheses. We propose a classification null hypothesis that can be simply modeled via permutation testing. A classification method is proposed based upon the Tissue Similarity Index of Sandberg and Ernberg (PNAS, 2005) that uses the classification null hypothesis. This method is demonstrated using the in vitro signature of Core Serum Response developed by Chang et al. (PLoS Biology, 2004).


Introduction
Integration of in vitro studies, i.e. experimental studies, with human "in vivo" gene expression studies is an area that is being considered more frequently in the functional genomic analysis of cancer. Hypotheses about cancer development, progression, and risk factors are diffi cult to test directly in a patient population. However, in experimental studies on cell cultures and model organisms, conditions can be specifi cally controlled to allow biological hypotheses to be tested. Integrating the results from such experiments with in vivo cancer signatures holds the potential to both infer activity of specifi c oncogenic pathways in vivo and to identify relevant effectors of oncogenic pathways.
To begin to understand the mechanisms by which oncogenes cause cancer, studies have used geneexpression profi ling to identify downstream targets of oncogenic pathways in cell-culture systems. Conceptually, this involves manipulating a gene in an in vitro system, measuring the global profi le using gene expression technology and then trying to relate the in vitro gene expression profi le to an in vivo gene expression profi le. Such an approach was taken by Lamb et al. (2003) to determine the direct transcriptional effects of the oncogene Cyclin D1. In vitro experiments were performed in which the Cyclin D1 was both over and under expressed, and global gene expression profi les were determined. Lamb et al. (2003) found that there was a signifi cant correlation between the targets found in vitro and the ordered gene list in a human tumor dataset thus suggesting the role of Cyclin D1 regulation in tumorigenesis, another example of in vitro/in vivo gene expression data integration appears in the study of Huang et al. (2003). They developed distinct in vitro oncogenic signatures for three transcription factors: Myc, Ras and E2F1-3. These signatures were able to predict the Myc and Ras state in mammary tumors that developed in transgenic mice expressing either Myc or Ras, suggesting that specifi c oncogenic events are encoded in global gene-expression profi les.
Additionally, studies have used gene-expression profi ling of cancerous growths induced in model organisms to examine tumor development or progression. Though model organism studies have the added diffi culty of mapping orthologous genes between organisms, a diffi culty not shared with tissue and cell cultures of human origin, there have been promising applications. For example, Sweet-Cordero et al. (2005) defi ned a KRAS induced lung cancer signature by comparing lung tumors generated from a spontaneous KRAS mutation mouse model to normal mouse lung tissue. They then correlated this KRAS lung cancer signature with gene expression profi les in human lung cancer studies and found that the mouse signature shared signifi cant similarity with human lung adenocarcinoma but not with other lung cancer types. Next, Sweet-Cordero et al. (2005) looked for evidence of the KRAS signature in human tumors carrying activating KRAS mutations relative to wild-type tumors. Although no individual genes were signifi cantly associated with the KRAS mutation status in human tumors, the mouse KRAS signature was signifi cantly enriched among genes rank-ordered by differential expression in human tumors with a KRAS mutation.
It is expected that in vitro/in vivo experiments such as those described in the previous two paragraphs will become much more commonplace in the future. Thus, it is critical to address statistical issues and to develop methods for integrating in vivo and in vitro genomic data so that inferences regarding transcriptional regulatory pathways in cancer can be generated. In this article, we discuss the statistical issues of the integration of these two types of datasets. We review various existing approaches and discuss their statistical advantages and disadvantages. In addition, we outline an approach for quantifying the predictive ability of a gene expression profi le determined from an in vitro experiment based on the tissue similarity approach of Sandberg and Ernberg (2005). We describe the application of the proposed methodology using in vitro data from a wound healing study conducted by Chang et al. (2004) and in vivo data from Glinsky et al. (2004), van't Veer et al. (2002), and Beer et al. (2002). Finally, we conclude with some discussion.

Background and Review
One class of methods that has been popular in the literature for in vitro/in vivo genomic data analysis is the following. First, one generates ordered lists of genes using the in vivo expression data. One then generates a differentially expressed gene list using the in vitro data and studies the overlap between the two lists. The seminal examples of this are in Mootha et al. (2003) and Lamb et al. (2003), which were then used as the basis of the Gene Set Enrichment Analysis (GSEA) method . We describe the GSEA methodology by briefl y reviewing what was done in the Lamb et al. (2003) study. First, a list of differentially expressed genes was generated based on the comparison of Cyclin D1 overexpressing relative to wildtype (no Cyclin D1 manipulation) mammary epithelial cell lines. Next, each gene's expression in vivo, from 190 human tumor samples of various origins, was correlated to that of Cyclin D1 and the genes were ranked accordingly. Then, a Kolmogorov-Smirnov (KS) statistic was used to determine if the in vitro differential expression list clustered within the correlation-ordered in vivo list. Since there was signifi cant evidence of clustering, Lamb et al. (2003) determined that the in vitro-defi ned targets of Cyclin D1 were correlated with their respective levels in vivo. This suggests that the direct regulatory effects of Cyclin D1 may play an important role in tumorigenesis.
There are some desirable features of the GSEA method. First, it utilizes all the information available in the in vivo gene expression data; no thresholding is done in that dataset. Second, a Kolmogorov-Smirnov statistic is used for the analysis, which is a non-parametric method and thus provides some robustness. However, there are several disadvantages to GSEA as well. For instance, note that there is thresholding done in the in vitro gene expression dataset to select the differentially expressed gene set. A potential improvement to the GSEA method, to avoid this thresholding, would be the following. First, one determines the common genes in the in vivo and in vitro datasets. One then takes the scores of differential expression from the in vitro data, fi nds the corresponding correlation scores (correlation with Cyclin D1) in the in vivo data and examines a scatterplot of the two variables. If the association is linear, then one tests for association using the Pearson correlation coeffi cient between the two variables. If instead the association appears nonlinear, then one could use a smoothing-spline based test (Lin, 1997). Such an approach would give a direct test of association between the correlations in vivo and the differential expression measurement in vitro without requiring thresholding of any datasets and would still allow for a nonlinear relationship between the two variables.
Before going further, let us consider the null hypothesis under consideration in the GSEA method, or the variants proposed above. Specifically, in the Lamb et al. (2003) study they test: H 0 : There is no association between differential expression of Cyclin D1-overexpressed, relative to nonoverexpressed, cell lines and correlation with Cyclin D1 in human tumors The alternative hypothesis is that there is an association. In specifying the null hypothesis we uncover a more subtle disadvantage of the GSEA Statistical issues of in vivo/in vitro data analysis method-the determination of the distribution of the KS test statistic under the null hypothesis. Two variants of permutation testing have been proposed by Subramanian et al. (2005) to elucidate the distribution of the KS test statistic assuming the null hypothesis is true. In the fi rst, the sample labels in the in vitro data are permuted, the differentially expressed gene signature is redefi ned, and the Kolmogorov-Smirnov statistic is recomputed based on this new signature; see Figure 1a, red.
Here the implication is that the correlation between the two Cyclin D1 levels in the cell line experiment is removed by the permutation. However, this addresses the differential expression in the in vitro samples but does not address a null association with the in vivo samples. In the second version, the sample labels in the in vitro and in vivo datasets are permuted, both the in vitro differential expression signature and the in vivo correlations are redefi ned, and the Kolmogorov-Smirnov statistic is recomputed; see Figure 1a, blue. Again, the implication is to remove the association within the in vitro and in vivo experiments. Yet this permutation scheme still does not address the association between the in vitro differential expression and the in vivo correlation. The role of permutation testing is to simulate the distribution of the test statistic assuming that H 0 is true; however, the two permutation schemes developed in the GSEA method do not do this. Permutation of the sample labels fails because the null hypothesis pertains to the population of genes in the two studies and not the relation of samples within a study. Additionally, Shedden (2004) suggests that permuting the sample labels of both the in vitro and in vivo data sets is not appropriate. Simply, if the permutation does not correctly model the null hypothesis correctly, then we are answering a different question than the one asked.
There is an alternative approach to the GSEA method for integrative analysis of in vitro and in vivo data, which is what we focus on in the rest of the paper. It is based on ideas of classification and clustering since the goal in many genomic studies utilizing high-throughput expression technologies is to develop a signature that can discriminate between relevant classes or groups of samples. In general, demonstration of the predictive or prognostic ability of a classification signature on independent data sets is a crucial step in the validation of that signature (Ransohoff, 2004  expression signatures discovered in vitro are often "validated" on independent in vivo data sets, such that the in vitro data is the training dataset and the in vivo data is the testing dataset. In this validation setting, the null hypothesis that we wish to test is the following: H 0 class : There exists no set of genes derived from the in vitro gene expression dataset that can predict clinical outcome in the in vivo expression data The alternative is that at least one set of genes derived from the in vitro data is predictive. Notice that this null hypothesis is different from the null hypothesis described for the GSEA method. For clarity, we will refer to H 0 class as the classifi cation null hypothesis.
An advantage of the classifi cation null hypothesis is that permutation testing becomes possible here. In particular if H 0 class is true, then any set of genes derived from the in vitro expression profi le data will have no ability to separate samples in the in vivo expression dataset with regard to a clinical outcome. Thus, we can take random sets of genes from the in vitro data and apply the classifi cation algorithm of interest. If the classifi cation null hypothesis is true, then all sets of genes, including the derived signature, should provide equal prediction performance.
The classifi cation null hypothesis has motivated the following algorithm that we have used in our previous work (Varambally et al. 2005). Here we are considering the genes common to the in vitro and in vivo expression datasets.
1. Derive a gene signature from the in vitro gene expression data; 2. Select those genes from the in vivo expression data that are included in the in vitro signature and cluster the samples from the in vivo expression data into two groups using hierarchical clustering with average linkage clustering and Euclidean distance; 3. Calculate the log-rank statistic for survival between the two groups of patients; 4. Let L denote the size of the gene list in 1. Randomly choose L genes from the in vitro data as the gene signature. Continue with steps 2 and 3 above. 5. Repeat steps 2-4 1000 times. Calculate the proportion of datasets in which the log-rank statistic is greater than the one calculated initially from the signature in step 1.
The proportion calculated in step 5 will be the permutation p-value under the classifi cation null hypothesis. This permutation scheme will form the basis of assessing signifi cance for our proposed analytical scheme described in the next section. We note that one could also modify the GSEA procedure in a similar way, as shown in Lamb et al. (2003), such that we randomly draw the gene set from the in vitro data rather than assessing differential expression based on permuted sample labels. Unfortunately, Shedden (2004) shows that when one does not account for gene-gene correlation, the resulting test statistic can be too liberal by as much as 10 times.
Notice that a limitation of the classifi cation null hypothesis is that the alternative hypothesis states that there exists at least one signature from the in vitro expression data that is predictive in the in vivo expression data. In fact the experimentally derived gene list need not be a unique classifi er. It has been recently noted that there are likely many gene signatures that have similar predictive power (Ein-Dor et al. 2005;Fan et al. 2006). It may be due in part to genetic redundancy or to the high correlation of genes within a pathway. Yet if the in vitro gene signature is able to predict prognosis better than a randomly selected set of genes we expect that there is biological signifi cance to that signature. Thus permutation testing helps us to determine if the gene set derived from the in vitro experiments is of interest for further study of its biological relevance.

Proposed methodology for in vitro/ in vivo analyses
The paper of Sandberg and Ernberg (2005) considers the relationship between the gene expression of in vitro cell cultures and their respective in vivo tumor samples. To that end they developed an algorithm for comparing gene expression values across experiments that they call the tissue similarity index (TSI). We use that algorithm here to compare the in vivo tumor samples to the in vitro samples of a lab experiment.
The algorithm of Sandberg and Ernberg (2005) is as follows; see Figure 1b. Principal component analysis is run on the covariance matrix of gene expression for genes in the in vitro dataset. Data Statistical issues of in vivo/in vitro data analysis are scaled across arrays so that each gene has a mean expression of zero and a unit standard deviation. The resulting eigenarrays (eigenvectors) are stored. To project the in vitro gene expression into the reduced dimensional space, created by the eigenarrays, calculate the correlation between each eigenarray and each in vitro sample array. The consensus signature for each experimental condition (serum induced and serum independent) is represented by its median centroid in the reduced space.
To integrate the in vivo data, fi rst map the in vivo samples into the same reduced space of the in vitro samples by again calculating the correlation between each eigenarray and in vivo sample array. To maintain scale in this correlation, the tumor samples are also standardized so that each gene has a mean expression of zero and a unit standard deviation. The distance between the in vivo tumor sample and each of the two consensus signatures, i.e. centroids, is calculated using Pearson correlation. Samples are classifi ed with the experimental condition with whose centroid they correlate best.
There are several differences between their and our implementations of TSI. First, in contrast to Sandberg and Ernberg (2005), we use positive statistical signifi cance of the TSI to determine classification, thus allowing some samples to remain unclassifi ed. In their paper they used an ad-hoc threshold value for TSI score, delineating moderate and high correlation groups. It is natural to believe that some of the in vivo samples will not correlate well with the in vitro conditions. These unclassifi ed samples may actually be informative in that they defi ne a subset of cases which do not meet our expectation as developed in the hypotheses tested in vitro. Second, the goal of the Sandberg and Ernberg (2005) paper was qualitative assessment of cell line gene expression relative to in vivo tumor gene expression, thus they do not address the issue of statistical signifi cance of their method. However, the classifi cation provided by the TSI can be tested for prognostic or diagnostic value depending upon the study goal.
Since the gene signature on which the classifi cation is based is determined from the in vitro data, and does not use the in vivo data, the statistical signifi cance of any tests on the in vivo data can be accepted without bias. This is an example of using the in vitro data for the training dataset and the in vivo data for the testing dataset. Indeed, if this in vivo validation is not at least marginally signifi cant it is not of interest to proceed further to test the classifi cation null hypothesis.
Using the TSI method, we develop a classifi cation scheme from the in vitro signature. The null hypothesis of interest is again the classifi cation null hypothesis as presented above. We thus propose the use of a permutation test to determine the utility of the gene signature in its classifi cation ability. In the following we slightly modify the permutation test procedure described in the previous section to account for gene-gene correlation within the in vitro gene signature. Specifi cally, as it is likely that genes within a pathway are correlated, it is reasonable to assume that the significantly differentially expressed genes that comprise the in vitro signature are correlated. Shedden, (2004) showed that this correlation can lead to invalid p-values. In the classical genetics setting, Nyholt, (2004) shows that permutation tests that do not account for this correlation can be misleading and proposes a simple adjustment. In essence, rather than randomly selecting L genes in each cycle of the permutation test, only M (M Յ L) genes are selected, where M is calculated to be the effective number of independent genes in the gene signature (Nyholt, 2004); see Figure 1b.
Finally, the permutation test for the TSI analysis has two interesting attributes against which the classifi cation signature is compared. Specifi cally, in permuting the data, the TSI scores are recalculated using the randomly selected gene list and with each randomly selected set of genes there is a possibility of unclassifi ed samples. Thus the classifi cation is compared to: (1) the measure of association with predictive factors in vivo, and (2) the percentage of unclassifi ed samples in vivo.

Data acquisition and preparation
For the purpose of demonstration we use, as the in vitro derived signature, the wound healing signature of Chang et al. (2004). Derived from cultured fi broblasts in the presence and absence of serum components, the wound healing signature is composed of 573 genes that are differentially expressed in response to serum. We consider the wound healing signature, or Core Serum Response (CSR), as the in vitro basis of classifi cation of in vivo tumor samples-prostate tumor samples (Glinsky et al. 2004), breast tumor samples (van't Veer et al. 2002, and lung tumor samples (Beer et al. 2002)-into good and bad prognosis groups.
The fi broblast gene expression data (Chang et al. 2004) was downloaded from the Stanford Microarray Database (SMD, http://smd.stanford.edu/ cgi-bin/publication/viewPublication.pl? pub_ no=293) (platform: cDNA microarray, 50 samples). The data were normalized using loess normalization by print block within array (Yang et al. 2002). Interarray variability was accounted for by scaling using the MAD (median absolute deviation). Missing data was imputed using KNN (K-nearest neighbors) imputation as implemented in the pam.r package (Hastie et al. 1999;Troyanskaya et al. 2001).
Localized prostate tumor probe-set level expression measures and recurrence free survival information (Glinsky et al. 2004)  Unigene Cluster ID number was used to map genes between platforms. Annotation information was acquired from SOURCE (Diehn et al. 2003). If, for a given platform, multiple measurements were represented by the same Unigene Cluster ID, these expres-sion values were averaged within array, thus allowing one-to-one mapping of genes between platforms. Genes were mapped to Unigene Cluster ID from GenBank Accession number if available (Chang et al. 2004;Glinksy et al. 2004;van't Veer et al. 2002) or from Unigene Symbol (Beer et al. 2002)

Application of the TSI based classifi er
The classifi er was built using the CSR in vitro signature and the TSI algorithm, described in the previous section and in Figure 1b. The classifi er was built for each of the three in vivo experiments using only those genes in the CSR signature that were common to both the in vivo and in vitro experiments; see Table 1 and Figure 2. All 50 eigenarrays were used for the TSI classifi cation algorithm and classifi cation is based on signifi cant positive correlation with one of the two CSR group centroids. Figure 3 plots the fi rst two dimensions of this reduced space for each of the three tumor types. The cell cultures that were grown in the presence of serum were considered to be serum induced, whereas those grown without serum components were serum independent. In vivo samples that correlate significantly (p < 0.05) with the composite serum induced signature, i.e. centroid, are classified as serum induced. Likewise, those in vivo samples correlating signifi cantly with the centroid of the serum independent samples are labeled serum independent. In vivo samples that do not correlate signifi cantly with either centroid remain unclassifi ed. In Figure 3, the tumor samples are colored according their classification and the in vitro samples and centroids are included for reference.
According to H 0 class , we wish to see if the in vitro derived CSR signature has prognostic ability in vivo. Table 1. Genes were matched between platforms using Unigene ID numbers. For the permutation testing, all common genes between the in vitro experiment (Chang et al. 2004) and each in vivo experiment were considered (prostate: Glinsky et al. 2004;breast: van't Veer et al. 2002;lung: Beer et al. 2002). The classifi cation of a set of in vivo samples was done based on only the CSR genes identifi ed in that data set. Permutation sample size was determined based on the effective number of independent genes in the CSR signature. Thus the prognostic ability of the CSR signature as a classifi er was tested using univariate Cox regression; see Table 2. The TSI score was incorporated through its discrete classifi cation of the in vivo samples, as described above. Figure 4 contains the Kaplan-Meier survival curves for this discrete classifi cation. The red and blue curves represent the serum activated and serum independent classifi cations, respectively. Log-rank statistics on the Kaplan-Meier estimates indicate that there is a signifi cant separation between the curves for the prostate tumors (p < 0.0001), the breast tumors (p = 0.0207), and the lung tumors (p = 0.0352). The tan curve shows the survival of those samples that did not signifi cantly correlate with either the serum activated or serum independent profi les and are thus left unclassifi ed by the TSI algorithm. When this unclassifi ed group was included in the Log-rank test of survival curve separation the prostate cancer and breast cancer samples remained significant (p < 0.0001 and p = 0.0078, respectively) whereas the lung cancer samples were marginally signifi cant (p = 0.0789).

Permutation testing of H 0 class
Accepting the above signifi cant separation of the Kaplan-Meier curves as validation of the CSR signature in vivo, we proceed to test the classifi cation null hypothesis using 1000 random samples from the genes in common between the in vivo and in vitro samples, see Table 1. The size of the randomly drawn set of genes was determined by the correlation in the original CSR genes, such that the randomly drawn sets contained an equivalent number of effectively independent genes as the CSR set. The TSI score was recalculated on each of these 1000 random gene sets. It was then used to classify the in vivo samples and predict survival. Figure 5 depicts the classifi cation and prediction ability of the 1000 random sets for each of the three in vivo data sets. The CSR gene predictor is colored red in these plots. The vertical axis plots the predictive ability of the gene set as the chi-squared test statistic associated with univariate Cox regression on the classifi er. If we look at the vertical margin we arrive at the permutation p-value as depicted by the marginal histogram. However, we have additional information about the utility of the CSR signature as a classifier. The horizontal axis provides the percentage of the samples that remained unclassifi ed in each of the 1000 random sets. In each case, the classifi er based on the CSR genes has a lower percentage of unclassified samples than any of the randomly drawn gene sets. Finally, note that for some of the randomly drawn gene sets, see Figures 5B and 5C, the samples were classifi ed into only one group and thus the chisquared test statistic could not be calculated. This occurred when the percentage of unclassified samples was high.

TSI based classifi cation shows heterogeneity among samples
As depicted in Figure 2, the simple dichotomization of in vivo samples by hierarchical clustering is far from optimal. By the nature of hierarchical clustering, dichotomization can be achieved by splitting samples at the fi rst node. In Figure 2 we have color coded the samples by their TSI predicted classifi cation (red = serum activated, blue = serum independent, tan = unclassifi ed) and we see that there is heterogeneity in the classifi cation suggested by dichotomization at the fi rst node of the dendrogram. This heterogeneity is apparent in the Kaplan Meier plots of Figure 4. Notice that the prostate samples appear to be least heterogeneous, see Figure 2B, in that most of the serum activated samples are clustered on the left and most of the  serum independent samples are clustered on the right with the unclassifi ed samples interspersed among both branches. The Kaplan Meier plot in Figure 4A suggests that those samples which can be classifi ed by their serum response have the best and worst recurrence free survival with the unclassifi ed samples having intermediate recurrence free survival. The intermediate nature of the unclassifi ed samples may be due to a third class of tumors with moderate serum response or it may be due to a blending of high risk and low risk samples that were not separated by the CSR signature. The breast cancer samples appear to have a more well defi ned subset of unclassifi ed samples, see Figure 2c. The far right branch of the dendrogram (as split on the second node) contains a high percentage of unclassifi ed samples. In the Figure  4b, the unclassifi ed samples are associated with a recurrence free survival curve that is worse than for the serum activated samples. In the lung cancer data it is not clear that classifi cation on any of the fi rst three nodes of the dendrogram would result in homogeneous classifi cation based on the CSR signature; see Figure 2d. However, using the TSI classifi cation we are able to signifi cantly split the samples into good and bad prognosis groups based on overall survival; see Figure 4c.

Differences in array confi gurations may reduce utility of the in vitro signature
One problem encountered in this analysis was the integration of gene expression data across micro-array platforms. We attempted to compensate for this numerically by global standardization that centered the array-wise median values at zero. Furthermore, in the TSI algorithm genes were standardized to zero mean and unit standard deviation before being mapped into the reduced space. An additional complication, beyond numerical scaling, is that the differing array confi gurations between the in vitro and in vivo experiments mean that only those genes with Unigene ID numbers common to both data sets can be considered. This initially excludes ESTs from the in vitro signature as well as other features that do not have Unigene ID numbers. The signature is further reduced by focusing on only the common genes between data sets as determined by Unigene ID. We expect that there is correlation between the genes within the CSR signature and thus the loss of some genes from this signature will be tolerable.
The most dramatic decrease in CSR genes available for the analysis was for the Beer et al. (2002) lung samples which measured only 32.6% of the 484 Unigene mapped CSR genes; see Table 1. It is possible that the high observed percentage of unclassifi ed samples, 51.2%, is related to this diminished in vitro signature. Also, notice that in Figure 3c, that the mapping of the in vitro samples into the reduced space appears to have fl ipped about horizontal axis from what we saw for the other two in vivo data sets. Since the reduced space is determined by the in vitro data we expect that this inversion is a result of the diminished in vitro signature. However, this inversion does not affect the association of the classifi cation with prognosis. As shown in Figure 4c the serum induced class has Table 2. Cox regression was run on the samples that were classifi ed as serum induced or serum independent by the CSR gene signature. Unclassifi ed samples are excluded from this analysis. The hazard ratios are relative to the serum independent classifi cation.

Poisson and Ghosh
worse overall survival than the serum independent class, as expected. This change in the reduced space mapping highlights the necessity to calculate the TSI classifier independently for each in vivo dataset, or particularly for each different array platform and confi guration used by the in vivo experiments.
Permutation plots provide a complete view of the null distribution Finally, we turn our attention to the permutation testing depicted in Figure 5. These three plots carry a lot of interesting information regarding the utility of the CSR gene signature as a predictor of survival among the three tumor types. First, consider the horizontal axes of Figures 5a-c. It is intriguing that for all three tumor types the CSR signature has the lowest percentage of unclassifi ed samples. Yet we see that percentage of classifi ed samples is not the sole predictor of significant separation in the survival curves since there are randomly selected gene sets that have higher percentages of unclassifi ed samples but also have higher test statistics.
Next, consider the empirical p-value for testing H 0 class . In the prostate samples, Figure 5a, the empirical p-value is 0.0040, whereas the p-value obtained from a simple training/testing strategy is very small (chi-squared test statistic = 20.89, p < 0.0001). In fact from the scale on the vertical axis we see that most of the random permutation samples were able to predict a signifi cant separation in the survival of the prostate cancer patients. Thus had we relied only on the training/testing strategy we could not distinguish that the CSR signature is superior to 99.6% of the randomly selected signatures. The range of scale of the test statistics for the breast cancer and lung cancer samples are less dramatic. In fact the empirical p-value for the lung cancer dataset behaves we would normally expect, showing that a minimally signifi cant test statistic in the training/testing setting (p = 0.0352) is indeed superior to test statistics generated under the classifi cation null hypothesis (empirical p = 0.0111).

Further thoughts on hypothesis testing
Here we have discussed the nature of hypothesis testing when integrating gene expression signa-tures derived from hypothesis driven in vitro studies with gene expression profi les of in vivo tumor samples. Assigning signifi cance to classification results and associations is necessary to evaluate the utility of the in vitro signature in cancer development and progression as found in vivo. However, for accurate assessment of significance it is necessary to consider the underlying null hypothesis that is being tested. Though the permutation test has been widely accepted as a panacea for signifi cance testing we discussed how the permutation must be done with care so that the underlying null distribution is appropriately reconstructed. We provide a method of assessing the classifi cation potential of an in vitro signature using the TSI classifi er of Sandberg and Ernberg (2005). The classifi cation null hypothesis underlies tests of this classifi er and thus permutation sampling is available for construction of the null distribution of test statistics.
Although we have discussed the value of a well defi ned null hypothesis, the interpretation of the alternative hypothesis comes into play when the null is rejected. In particular, the alternative hypothesis for the classifi cation null hypothesis is that at least one predictive signature exists. The number of such signatures is not known and thus the signature being tested need not be unique. We know from Fan et al. (2006) that there are likely to be several predictive signatures in any gene expression study. These may be biologically similar but need not share a substantial number of genes. Rejection of the classifi cation null hypothesis only provides evidence for the existence of one or more predictive signatures. In fact a significant empirical p-value suggests only superiority of the in vitro gene list above the randomly generated gene lists of comparable size. If other in vitro hypotheses were tested the gene signatures generated may also be predictive. Yet, we argue that regardless of its uniqueness, any gene signature that rejects the classifi cation null hypothesis is worthy of further study of its biological relevance.

Further thoughts on thresholding
We have remarked at several points about the use of thresholding in the algorithms. Arbitrary thresholds used to select genes that are interesting biologically or signatures that are signifi cant statistically may not always be satisfying. We briefl y Statistical issues of in vivo/in vitro data analysis discussed how the concept of GSEA could be adapted to a regression model that would not require a strict defi nition of the in vitro gene set of interest. Yet, the TSI-type algorithm that we proposed still used thresholding of the in vitro data to produce a signature. This is not necessary for the sake of the algorithm and classifi cation should still be possible using the entire gene signature of the in vitro samples. Ultimately the classifi er is built on the correlation of the in vivo samples to composite signatures for each experimental condition in the reduced space.
We again used thresholding in the classifi cation of the in vivo samples by requiring a signifi cant correlation with one of the experimental centroids. The threshold for signifi cance was left at the typical level of p < 0.05, although this could be adjusted to achieve desired specifi city and sensitivity in the classifi er by dividing the in vivo samples into training and test sets and examining various thresholds. Alternatively, the correlation score could be used as a continuous variable in the Cox regression models. In this way those samples that were not classifi ed in the dichotomous classifi cation would contribute to the model.